27 research outputs found

    Incorporating peak grouping information for alignment of multiple liquid chromatography-mass spectrometry datasets

    Get PDF
    Motivation: The combination of liquid chromatography and mass spectrometry (LC/MS) has been widely used for large-scale comparative studies in systems biology, including proteomics, glycomics and metabolomics. In almost all experimental design, it is necessary to compare chromatograms across biological or technical replicates and across sample groups. Central to this is the peak alignment step, which is one of the most important but challenging preprocessing steps. Existing alignment tools do not take into account the structural dependencies between related peaks that co-elute and are derived from the same metabolite or peptide. We propose a direct matching peak alignment method for LC/MS data that incorporates related peaks information (within each LC/MS run) and investigate its effect on alignment performance (across runs). The groupings of related peaks necessary for our method can be obtained from any peak clustering method and are built into a pairwise peak similarity score function. The similarity score matrix produced is used by an approximation algorithm for the weighted matching problem to produce the actual alignment result.<p></p> Results: We demonstrate that related peak information can improve alignment performance. The performance is evaluated on a set of benchmark datasets, where our method performs competitively compared to other popular alignment tools.<p></p> Availability: The proposed alignment method has been implemented as a stand-alone application in Python, available for download at http://github.com/joewandy/peak-grouping-alignment.<p></p&gt

    Unsupervised Bayesian explorations of mass spectrometry data

    Get PDF
    In recent years, the large-scale, untargeted studies of the compounds that serve as workers in the cell (proteins) and the small molecules involved in essential life-sustaining chemical processes (metabolites) have provided insights into a wide array of fields, such as medical diagnostics, drug discovery, personalised medicine and many others. Measurements in such studies are routinely performed using liquid chromatography mass spectrometry (LC-MS) instruments. From these measurements, we obtain a set of peaks having mass-to-charge, retention time (RT) and intensity values. Before further analysis is possible, the raw LC-MS data has to be processed in a data pre-preprocessing pipeline. In the alignment step of the pipeline, peaks from multiple LC-MS measurements have to be matched. In the identification step, the identity of unknown compounds in the sample that generate the observed peaks have to be assigned. Using tandem mass spectrometry, fragmentation peaks characteristic to a compound can be obtained and used to help establish the identity of the compound. Alignment and identification are challenging because the true identities of the entire set of compounds in the sample are unknown, and a single compound can produce many observed peaks, each with a potential drift in its retention time value. These observed peaks are not independent as they can be explained as being generated by the same compound. The aim of this thesis is to introduce methods to group these related peaks and to use these groupings to improve alignment and assist in identification during data pre-processing. Firstly, we introduce a generative model to group related peaks by their retention time. This information is used to influence direct-matching alignment, bringing related peak groups closer during matching. Investigations using benchmark datasets reveal that improved alignment performance is obtained from this approach. Next, we also consider mass information in the grouping process, resulting in PrecursorCluster, a model that performs the grouping of related peaks in metabolomics by their explainable mass relationships, RT and intensity values. Through a second-stage process that matches these related peak groups, peak alignment is produced. Experiments on benchmark datasets show that an improved alignment performance is obtained, while uncertainties in matched peaksets can also be extracted from the method. In the next section, we expand upon this two-stage method and introduce HDPAlign, a model that performs the clustering of related peaks within and across multiple LC-MS runs at once. This allows for matched peaksets and their respective uncertainties to be naturally extracted from the model. Finally, we look at fragmentation peaks used for identification and introduce MS2LDA, a topic model to group related fragmentation features. These groups of related fragmentation features potentially correspond to substructures shared by metabolites and can be used to assist data interpretation during identification. This final section corresponds to a work in progress and points to many interesting avenues for future research

    MetAssign: probabilistic annotation of metabolites from LC–MS data using a Bayesian clustering approach

    Get PDF
    Motivation: The use of liquid chromatography coupled to mass spectrometry (LC–MS) has enabled the high-throughput profiling of the metabolite composition of biological samples. However, the large amount of data obtained can be difficult to analyse and often requires computational processing to understand which metabolites are present in a sample. This paper looks at the dual problem of annotating peaks in a sample with a metabolite, together with putatively annotating whether a metabolite is present in the sample. The starting point of the approach is a Bayesian clustering of peaks into groups, each corresponding to putative adducts and isotopes of a single metabolite.<p></p> Results: The Bayesian modelling introduced here combines information from the mass-to-charge ratio, retention time and intensity of each peak, together with a model of the inter-peak dependency structure, to increase the accuracy of peak annotation. The results inherently contain a quantitative estimate of confidence in the peak annotations and allow an accurate trade off between precision and recall. Extensive validation experiments using authentic chemical standards show that this system is able to produce more accurate putative identifications than other state-of-the-art systems, while at the same time giving a probabilistic measure of confidence in the annotations.<p></p> Availability: The software has been implemented as part of the mzMatch metabolomics analysis pipeline, which is available for download at http://mzmatch.sourceforge.net/

    GraphOmics: an interactive platform to explore and integrate multi-omics data

    Get PDF
    Background: An increasing number of studies now produce multiple omics measurements that require using sophisticated computational methods for analysis. While each omics data can be examined separately, jointly integrating multiple omics data allows for deeper understanding and insights to be gained from the study. In particular, data integration can be performed horizontally, where biological entities from multiple omics measurements are mapped to common reactions and pathways. However, data integration remains a challenge due to the complexity of the data and the difficulty in interpreting analysis results. Results: Here we present GraphOmics, a user-friendly platform to explore and integrate multiple omics datasets and support hypothesis generation. Users can upload transcriptomics, proteomics and metabolomics data to GraphOmics. Relevant entities are connected based on their biochemical relationships, and mapped to reactions and pathways from Reactome. From the Data Browser in GraphOmics, mapped entities and pathways can be ranked, sorted and filtered according to their statistical significance (p values) and fold changes. Context-sensitive panels provide information on the currently selected entities, while interactive heatmaps and clustering functionalities are also available. As a case study, we demonstrated how GraphOmics was used to interactively explore multi-omics data and support hypothesis generation using two complex datasets from existing Zebrafish regeneration and Covid-19 human studies. Conclusions: GraphOmics is fully open-sourced and freely accessible from https://graphomics.glasgowcompbio.org/. It can be used to integrate multiple omics data horizontally by mapping entities across omics to reactions and pathways. Our demonstration showed that by using interactive explorations from GraphOmics, interesting insights and biological hypotheses could be rapidly revealed

    Ms2lda.org: web-based topic modelling for substructure discovery in mass spectrometry

    Get PDF
    Motivation: We recently published MS2LDA, a method for the decomposition of sets of molecular fragment data derived from large metabolomics experiments. To make the method more widely available to the community, here we present ms2lda.org, a web application that allows users to upload their data, run MS2LDA analyses and explore the results through interactive visualisations. Results: Ms2lda.org takes tandem mass spectrometry data in many standard formats and allows the user to infer the sets of fragment and neutral loss features that co-occur together (Mass2Motifs). As an alternative workflow, the user can also decompose a dataset onto predefined Mass2Motifs. This is accomplished through the web interface or programmatically from our web service

    R package for statistical inference in dynamical systems using kernel based gradient matching: KGode

    Get PDF
    Many processes in science and engineering can be described by dynamical systems based on nonlinear ordinary differential equations (ODEs). Often ODE parameters are unknown and not directly measurable. Since nonlinear ODEs typically have no closed form solution, standard iterative inference procedures require a computationally expensive numerical integration of the ODEs every time the parameters are adapted, which in practice restricts statistical inference to rather small systems. To overcome this computational bottleneck, approximate methods based on gradient matching have recently gained much attention. The idea is to circumvent the numerical integration step by using a surrogate cost function that quantifies the discrepancy between the derivatives obtained from a smooth interpolant to the data and the derivatives predicted by the ODEs. The present article describes the software implementation of a recent method that is based on the framework of reproducing kernel Hilbert spaces. We provide an overview of the methods available, illustrate them on a series of widely used benchmark problems, and discuss the accuracy–efficiency trade-off of various regularization methods

    In silico optimization of mass spectrometry fragmentation strategies in metabolomics

    Get PDF
    Liquid chromatography (LC) coupled to tandem mass spectrometry (MS/MS) is widely used in identifying small molecules in untargeted metabolomics. Various strategies exist to acquire MS/MS fragmentation spectra; however, the development of new acquisition strategies is hampered by the lack of simulators that let researchers prototype, compare, and optimize strategies before validations on real machines. We introduce Virtual Metabolomics Mass Spectrometer (ViMMS), a metabolomics LC-MS/MS simulator framework that allows for scan-level control of the MS2 acquisition process in silico. ViMMS can generate new LC-MS/MS data based on empirical data or virtually re-run a previous LC-MS/MS analysis using pre-existing data to allow the testing of different fragmentation strategies. To demonstrate its utility, we show how ViMMS can be used to optimize N for Top-N data-dependent acquisition (DDA) acquisition, giving results comparable to modifying N on the mass spectrometer. We expect that ViMMS will save method development time by allowing for offline evaluation of novel fragmentation strategies and optimization of the fragmentation strategy for a particular experiment

    Deciphering complex metabolite mixtures by unsupervised and supervised substructure discovery and semi-automated annotation from MS/MS spectra

    Get PDF
    Complex metabolite mixtures are challenging to unravel. Mass spectrometry (MS) is a widely used and sensitive technique to obtain structural information on complex mixtures. However, just knowing the molecular masses of the mixture’s constituents is almost always insufficient for confident assignment of the associated chemical structures. Structural information can be augmented through MS fragmentation experiments whereby detected metabolites are fragmented giving rise to MS/MS spectra. However, how can we maximize the structural information we gain from fragmentation spectra? We recently proposed a substructure-based strategy to enhance metabolite annotation for complex mixtures by considering metabolites as the sum of (bio)chemically relevant moieties that we can detect through mass spectrometry fragmentation approaches. Our MS2LDA tool allows us to discover - unsupervised - groups of mass fragments and/or neutral losses termed Mass2Motifs that often correspond to substructures. After manual annotation, these Mass2Motifs can be used in subsequent MS2LDA analyses of new datasets, thereby providing structural annotations for many molecules that are not present in spectral databases. Here, we describe how additional strategies, taking advantage of i) combinatorial in-silico matching of experimental mass features to substructures of candidate molecules, and ii) automated machine learning classification of molecules, can facilitate semi-automated annotation of substructures. We show how our approach accelerates the Mass2Motif annotation process and therefore broadens the chemical space spanned by characterized motifs. Our machine learning model used to classify fragmentation spectra learns the relationships between fragment spectra and chemical features. Classification prediction on these features can be aggregated for all molecules that contribute to a particular Mass2Motif and guide Mass2Motif annotations. To make annotated Mass2Motifs available to the community, we also present motifDB: an open database of Mass2Motifs that can be browsed and accessed programmatically through an Application Programming Interface (API). MotifDB is integrated within ms2lda.org, allowing users to efficiently search for characterized motifs in their own experiments. We expect that with an increasing number of Mass2Motif annotations available through a growing database we can more quickly gain insight in the constituents of complex mixtures. That will allow prioritization towards novel or unexpected chemistries and faster recognition of known biochemical building blocks

    Topic modeling for untargeted substructure exploration in metabolomics

    Get PDF
    The potential of untargeted metabolomics to answer important questions across the life sciences is hindered due to a paucity of computational tools that enable extraction of key biochemically relevant information. Available tools focus on using mass spectrometry fragmentation spectra to identify molecules whose behavior suggests they are relevant to the system under study. Unfortunately, fragmentation spectra cannot identify molecules in isolation, but require authentic standards or databases of known fragmented molecules. Fragmentation spectra are, however, replete with information pertaining to the biochemical processes present; much of which is currently neglected. Here we present an analytical workflow that exploits all fragmentation data from a given experiment to extract biochemically-relevant features in an unsupervised manner. We demonstrate that an algorithm originally utilized for text-mining, Latent Dirichlet Allocation, can be adapted to handle metabolomics datasets. Our approach extracts biochemically-relevant molecular substructures (‘Mass2Motifs’) from spectra as sets of co-occurring molecular fragments and neutral losses. The analysis allows us to isolate molecular substructures, whose presence allows molecules to be grouped based on shared substructures regardless of classical spectral similarity. These substructures in turn support putative de novo structural annotation of molecules. Combining this spectral connectivity to orthogonal correlations (e.g. common abundance changes under system perturbation) significantly enhances our ability to provide mechanistic explanations for biological behavior

    PiMP my metabolome:An integrated, web-based tool for LC-MS metabolomics data

    Get PDF
    Summary: The Polyomics integrated Metabolomics Pipeline (PiMP) fulfils an unmet need in metabolomics data analysis. PiMP offers automated and user-friendly analysis from mass spectrometry data acquisition to biological interpretation. Our key innovations are the Summary Page, which provides a simple overview of the experiment in the format of a scientific paper, containing the key findings of the experiment along with associated metadata; and the Metabolite Page, which provides a list of each metabolite accompanied by ‘evidence cards’, which provide a variety of criteria behind metabolite annotation including peak shapes, intensities in different sample groups and database information. Availability: PiMP is available at http://polyomics.mvls.gla.ac.uk, and access is freely available on request. 50 GB of space is allocated for data storage, with unrestricted number of samples and analyses per user. Source code is available at https://github.com/RonanDaly/pimp and licensed under the GPL
    corecore